Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

BiBERT: Accurate Fully Binarized BERT

141

∂bool(x)

∂x

if |x| ≤1

otherwise.

(5.29)

By applying bool(·) function, the elements in attention weight with lower value are binarized

to 0. Thus the obtained entropy-maximized attention weight can ﬁlter the crucial part of

elements. And the proposed Bi-Attention structure is ﬁnally expressed as

BA = bool (A) = bool

( 1

√

BQ ⊗BK

⊤⁾

(5.30)

Bi-Attention(BQ, BK, BV) = BA ⊠BV,

(5.31)

where BV is the binarized value obtained by sign(V), BA is the binarized attention weight,

and ⊠is a well-designed Bitwise-Aﬃne Matrix Multiplication (BAMM) operator composed

by ⊗and bitshift to align training and inference representations and perform eﬃcient bitwise

calculation.

In a nutshell, in Bi-Attention structure, the information entropy of binarized attention

weight is maximized (as Fig. 5.14(c) shows) to alleviate its immense information degradation

and revive the attention mechanism. Bi-Attention also achieves greater eﬃciency since the

softmax is excluded.

5.9.2

Direction-Matching Distillation

As an optimization technique based on element-level comparison of activation, distillation

allows the binarized BERT to mimic the full-precision teacher model about intermediate

activation. However, distillation causes direction mismatch for optimization in the fully

binarized BERT baseline, leading to insuﬃcient optimization and even harmful eﬀects. To

address the direction mismatch occurred in fully binarized BERT baseline in the backward

propagation, the authors further proposed a DMD scheme with apposite distilled activations

and the well-constructed similarity matrices to eﬀectively utilize knowledge from the teacher,

which optimizes the fully binarized BERT more accurately.

Their eﬀorts ﬁrst fall into reselecting the distilled activations for DMD by distilling the

upstream query Q and key K instead of attention score in DMD for distillation to utilize its

knowledge while alleviating direction mismatch. Besides, the authors also distilled the value

V to further cover all the inputs of MHA. Then, similarity pattern matrices are constructed

for distilling activation, which can be expressed as

PQ =

Q × Q^⊤

∥Q × Q^⊤∥^,^P^K⁼

K × K^⊤

∥K × K^⊤∥^,^P^V⁼

V × V^⊤

∥V × V^⊤∥^,

(5.32)

where ∥· ∥denotes ℓ2 normalization. The corresponding PQT , PKT , PVT are constructed

in the same way by the teacher’s activation. The distillation loss is expressed as:

ℓdistill = ℓDMD + ℓhid + ℓpred,

(5.33)

ℓDMD =

l∈[1,L]

F∈FDMD

∥PFl −PFT l∥,

(5.34)

where L denotes the number of transformer layers, FDMD = {Q, K, V}. The loss term ℓhid

is constructed as the ℓ2 normalization form.

The overall pipeline for BiBERT is shown in Fig. 5.15. The authors conducted experi-

ments on the GLUE benchmark for binarizing various BERT-based pre-trained models. The

results listed in Table 5.7 shows that BiBERT surpasses BinaryBERT by a wide margin in

the average accuracy.